The psychological construct of self-regulation, or related concepts of impulsivity, self-control, inhibition and others correlate with problematic real-world behaviors such as disordered eating (Mobbs et al., 2010, Verbeken et al., 2009, Nederkoorn et al. 2006a, Nederkoorn et al., 2006b), gambling (Lawrence et al., 2009, Fuentes et al., 2006, Alessi and Petry, 2003), drug addiction (Coffey et al., 2003, de Wit, Crean and John, 2000, Sher, Bartholow and Wood, 2000, Kirby, Petry and Bickel, 1999) or bad financial decisions (Meier and Sprenger, 2015, Meier and Sprenger, 2013, Meier and Sprenger 2012). In these lines of work measures of self-regulation are used as behavioral assays for individual difference analyses. An underlying assumption of treating behavioral measures as reflective of person-specific characteristics is that the measures of self-regulation are stable across time. In other words, that they have high test-retest reliability (or high between- and low within-subject variability). This assumption is not extensively and equally tested in the literature for all measures of self-regulation.
Self-regulation measures can be grouped in two broad categories: Surveys and cognitive tasks. While the former typically goes through some form of psychometric testing often filtering out items with low test-retest reliability this is less frequently true for the latter. Though some empirical results on test-retest reliabilities of certain cognitive task measures are reported in pockets of the literature an exhaustive summary or systematic elimination of tasks or measures based on stability across time does not seem to exist. Test-retest reliability is, however, crucial for individual difference analyses (Hedge, Powell and Sumner, 2017).
To answer the question of reliability of self-regulation measures comprehensively we created a large battery consisting of both cognitive task and survey measures. In this paper we make two contributions: First we provide a review of these measures along with their reported reliabilities when available. Then we report results of a large study we conducted where we collected retest reliability data using all of these measures. This effort not only fills in a notable gap in the literature in clarifying which measures of self-regulation are stable across time but also provides guidance on factors to pay attention to when constructing new tasks for individual difference measures.
This sample consists of data for 150 subjects of the original sample of 522 that has completed the initial battery of 37 cognitive tasks, 23 surveys and 3 surveys on demographics. Details of the original sample as well as quality control (qc) procedures are described elsewhere (Eisenberg et al., 2017). Invited participants were chosen randomly and only subsets of them were invited for a given batch (instead of opening the battery to all qualified subjects) with the intention to avoid a potential oversampling and bias towards “high self regulators”.
workers = read.csv(paste0(retest_data_path,'Local/User_717570_workers.csv'))
workers = workers %>%
group_by(Worker.ID) %>%
mutate(Retest_worker=ifelse(sum(CURRENT.RetestWorker,CURRENT.RetestWorkerB2,CURRENT.RetestWorkerB3,CURRENT.RetestWorkerB4,CURRENT.RetestWorkerB5,na.rm=T)>0,1,0)) %>%
ungroup()
worker_counts <- fromJSON(paste0(retest_data_path,'/Local/retest_worker_counts.json'))
worker_counts = as.data.frame(unlist(worker_counts))
names(worker_counts) = "task_count"
In total 242 participants were invited, 175 began the battery, 157 completed the battery and 150 provided data that passed qc for both time points. Our target sample size was determined in advance of data collection and data collection continued until this number of participants with data that survived qc was reached.
disc_comp_date = read.csv(paste0(retest_data_path,'Local/discovery_completion_dates.csv'), header=FALSE)
val_comp_date = read.csv(paste0(retest_data_path,'Local/validation_completion_dates.csv'), header=FALSE)
test_comp_date = rbind(disc_comp_date, val_comp_date)
rm(disc_comp_date, val_comp_date)
retest_comp_date = read.csv(paste0(retest_data_path,'Local/retest_completion_dates.csv'), header=FALSE)
comp_dates = merge(retest_comp_date, test_comp_date, by="V1")
names(comp_dates) <- c("sub_id", "retest_comp", "test_comp")
comp_dates$retest_comp = as.Date(comp_dates$retest_comp)
comp_dates$test_comp = as.Date(comp_dates$test_comp)
comp_dates$days_btw = with(comp_dates, retest_comp-test_comp)
Data collection took place on average 115 number of days after the completion of the initial battery with a range of 60 to 228 days.
rm(test_comp_date, retest_comp_date, comp_dates)
test_demog <- read.csv(paste0(retest_data_path, '/t1_data/demographic_health.csv'))
retest_demog <- read.csv(paste0(retest_data_path, 'demographic_health.csv'))
retest_demog = retest_demog[retest_demog$X %in% test_demog$X,]
names(test_demog)[which(names(test_demog) == 'X')] <-'sub_id'
names(retest_demog)[which(names(retest_demog) == 'X')] <-'sub_id'
summary(test_demog %>%
select(Sex, Age))
Sex Age
Min. :0.00 Min. :21.0
1st Qu.:0.00 1st Qu.:28.0
Median :1.00 Median :33.0
Mean :0.52 Mean :34.1
3rd Qu.:1.00 3rd Qu.:38.8
Max. :1.00 Max. :59.0
summary(retest_demog %>%
select(Sex, Age))
Sex Age
Min. :0.000 Min. :21.0
1st Qu.:0.000 1st Qu.:29.0
Median :1.000 Median :33.0
Mean :0.527 Mean :34.5
3rd Qu.:1.000 3rd Qu.:38.8
Max. :1.000 Max. :60.0
One of the major contributions of this project is a comprehensive literature review of the retest reliabilities of the surveys and tasks that were used. We reviewed the literature on a measure (as opposed to task level) paying attention to differences in sample size, the delay between the two measurements as well as the statistic that was used to assess reliabilities (e.g. Spearman vs. Pearson correlations). Here we present a table and a visualization summarizing our findings. References mentioned in the table below can be found here.
lit_review <- read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/input/lit_review_figure.csv')
lit_review
lit_review = lit_review %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)]),
type = as.character(type)) %>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
arrange(task_group, raw_fit, var) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group)) %>%
select(-measure_description)
Warning in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels)
else paste0(labels, : duplicated levels in factors are deprecated
tmp = lit_review[duplicated(lit_review$reference)==FALSE,]
nrow(tmp)
[1] 154
sum(tmp$sample_size)
[1] 17550
nrow(lit_review)
[1] 583
rm(tmp)
Measure level plot
lit_review = lit_review %>%
select(-reference)
Because this plot is difficult to digest we summarize it on a task level to give a general sense of the main takeaways. This plot naturally disregards much of the fine grained information.
p1_t_legend <- lit_review %>%
filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) +
geom_point(aes(size=sample_size, shape = type), color='black')+
theme(axis.text.y = element_text(size=43),
legend.position = 'bottom',
axis.text.x = element_text(size=23),
axis.title.x = element_text(size=30),
legend.text = element_text(size=20),
legend.key.width = unit(0.75, "inches"),
legend.title = element_text(size=28),
legend.spacing.x = unit(0.5, "inches")) +
guides(size = guide_legend(override.aes = list(size=c(9,18,28))),
shape = guide_legend(override.aes = list(size=18)))+
xlab("Retest Reliability")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3), name="Type")+
scale_size_continuous(name = "Sample Size")+
geom_vline(xintercept = 0, color = "red", size = 1)
p1_t <- lit_review %>%
filter(task == 'task') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) +
geom_point(aes(size=sample_size, shape = type), color='#00BFC4')+
theme(axis.text.y = element_text(size=43),
legend.position = 'none',
axis.text.x = element_text(size=23),
axis.title.x = element_text(size=30)) +
xlab("Retest Reliability")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
scale_size_continuous(range = c(5, 35))+
geom_vline(xintercept = 0, color = "red", size = 1)
p2_t <- lit_review %>%
filter(task == 'survey') %>%
ggplot(aes(y = factor(task_group, levels=rev(unique(task_group))), x = retest_reliability)) +
geom_point(aes(size=sample_size, shape = type), color='#F8766D')+
theme(axis.text.y = element_text(size=43),
legend.position = 'none',
axis.text.x = element_text(size=23),
axis.title.x = element_text(size=30)) +
xlab("Retest Reliability")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
scale_shape_manual(breaks = sort(lit_review$type), values = c(15, 16, 17, 3))+
scale_size_continuous(range = c(5, 35))+
geom_vline(xintercept = 0, color = "red", size = 1)
mylegend<-g_legend(p1_t_legend)
p3_t <- arrangeGrob(arrangeGrob(p1_t, p2_t, nrow=1), mylegend, nrow=2,heights=c(10, 1))
ggsave('Lit_Review_Plot_t.jpg', plot = p3_t, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 72)
rm(p1_t, p2_t, p3_t, mylegend, p1_t_legend)
An interactive version of this plot could be find here
Takeaways from this review are:
- Survey measures have been reported to higher reliability compared to task measures
- Survey measures have less variability in the reported reliabiltiy estimates compared to task measures
The variables included in this report are:
- meaningful variables (includes only hdddm parameters)
- EZ diffusion parameters
- Raw RT and Accuracy measures
- Variables found in the literature (for comparison)
test_data <- read.csv(paste0(retest_data_path,'t1_data/variables_exhaustive.csv'))
test_data <- test_data[,names(test_data) %in% retest_report_vars]
test_data$X <- as.character(test_data$X)
names(test_data)[which(names(test_data) == 'X')] <-'sub_id'
For reference here are the variables that are not included in the analyses of the remainder of this report because they were not of theoretical interest in factor structure analyses of this data so far. These include drift diffusion and other model parameters for specific conditions within a task; survey variables that are not part of the dependant variables for that survey in the literature and demographics (these are saved for prediction analyses).
test_data2 <- read.csv(paste0(retest_data_path, 't1_data/variables_exhaustive.csv'))
df <- data.frame(names(test_data2)[which(names(test_data2) %in% names(test_data) == FALSE)])
names(df) = c('vars')
df
retest_data <- read.csv(paste0(retest_data_path,'variables_exhaustive.csv'))
retest_data <- retest_data[,names(retest_data) %in% retest_report_vars]
retest_data$X <- as.character(retest_data$X)
names(retest_data)[which(names(retest_data) == 'X')] <-'sub_id'
retest_data = retest_data[retest_data$sub_id %in% test_data$sub_id,]
Since HDDM parameters depend on the sample on which they are fit we refit the model on t1 data for the subjects that have t2 data. Here we replace the HDDM parameters in the current t1 dataset with these refitted values.
hddm_refits <- read.csv(paste0(retest_data_path,'t1_data/hddm_refits_exhaustive.csv'))
hddm_refits = hddm_refits[,names(hddm_refits) %in% retest_report_vars]
hddm_refits$X <- as.character(hddm_refits$X)
names(hddm_refits)[which(names(hddm_refits) == 'X')] <-'sub_id'
#For later comparison of whether fitting the DDM parameters on full or retest sample makes a big difference
test_data_full_sample_hddm <- test_data
#drop hddm columns from test_data
test_data = cbind(test_data$sub_id, test_data[,names(test_data) %in% names(hddm_refits) == FALSE])
#fix naming before merging
names(test_data)[which(names(test_data) == 'test_data$sub_id')] <-'sub_id'
#merge hddm refits to test data
test_data = merge(test_data, hddm_refits, by="sub_id")
Point estimates of reliability for the demographic variabels.
numeric_cols = c()
for(i in 1:length(names(test_demog))){
if(is.numeric(test_demog[,i])){
numeric_cols <- c(numeric_cols, names(test_demog)[i])
}
}
demog_rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
icc = rep(NA, length(numeric_cols)),
pearson = rep(NA, length(numeric_cols)))
row.names(demog_rel_df) <- numeric_cols
for(i in 1:length(numeric_cols)){
demog_rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
demog_rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
demog_rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i], t1_df = test_demog, t2_df = retest_demog)
}
demog_rel_df = demog_rel_df %>%
mutate(var = row.names(.)) %>%
select(var, icc, spearman, pearson) %>%
arrange(-icc)
demog_rel_df
sjt.df(demog_rel_df %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/demog_rel_table.doc")
| var | icc | spearman | pearson |
|---|---|---|---|
| HispanicLatino | 1 | 1 | 1 |
| Age | 0.998 | 0.996 | 0.997 |
| Sex | 0.993 | 0.987 | 0.987 |
| HouseholdIncome | 0.988 | 0.955 | 0.977 |
| ArrestedChargedLifeCount | 0.958 | 0.829 | 0.92 |
| HighestEducation | 0.951 | 0.895 | 0.908 |
| WeightPounds | 0.951 | 0.936 | 0.907 |
| ChildrenNumber | 0.95 | 0.971 | 0.907 |
| HowLongSmoked | 0.949 | 0.877 | 0.904 |
| RelationshipNumber | 0.948 | 0.9 | 0.901 |
| CigsPerDay | 0.945 | 0.841 | 0.895 |
| LifetimeSmoke100Cigs | 0.944 | 0.894 | 0.894 |
| AlcoholHowManyDrinksDay | 0.936 | 0.871 | 0.881 |
| MortgageDebt | 0.93 | 0.833 | 0.87 |
| RetirementPercentStocks | 0.928 | 0.863 | 0.867 |
| LongestRelationship | 0.928 | 0.853 | 0.867 |
| HowOftenFailedActivitiesDrinking | 0.927 | 0.782 | 0.866 |
| CoffeeCupsPerDay | 0.926 | 0.876 | 0.863 |
| RentOwn | 0.926 | 0.862 | 0.862 |
| HowOftenMemoryConcentrationProblemCannabis | 0.925 | 0.594 | 0.863 |
| SmokeEveryDay | 0.916 | 0.802 | 0.846 |
| HeightInches | 0.907 | 0.888 | 0.83 |
| DivorceCount | 0.905 | 0.935 | 0.836 |
| CannabisHowOften | 0.903 | 0.856 | 0.823 |
| EducationDebt | 0.901 | 0.837 | 0.821 |
| HowOftenDevotedTimeCannabis | 0.898 | 0.814 | 0.814 |
| MedicalProblemsDueToDrugUse | 0.888 | 0.814 | 0.814 |
| AlcoholHowOften | 0.881 | 0.788 | 0.79 |
| RelationshipStatus | 0.87 | 0.788 | 0.77 |
| HowSoonSmokeAfterWaking | 0.868 | 0.787 | 0.767 |
| RetirementAccount | 0.86 | 0.756 | 0.756 |
| AlcoholHowOften6Drinks | 0.853 | 0.766 | 0.747 |
| CreditCardDebt | 0.846 | 0.808 | 0.733 |
| RelativeFriendConcernedDrinking | 0.844 | 0.71 | 0.733 |
| CannabisPast6Months | 0.84 | 0.725 | 0.725 |
| TeaCupsPerDay | 0.834 | 0.633 | 0.715 |
| HowOftenCantStopDrinking | 0.833 | 0.704 | 0.716 |
| CaffienatedSodaCansPerDay | 0.832 | 0.711 | 0.723 |
| HowOftenGuiltRemorseDrinking | 0.829 | 0.564 | 0.708 |
| CannabisHoursStoned | 0.825 | 0.698 | 0.703 |
| DoctorVisitsLastMonth | 0.815 | 0.351 | 0.701 |
| HowOftenDrinkMorning | 0.809 | 0.39 | 0.692 |
| CannabisConsideredReduction | 0.806 | 0.638 | 0.675 |
| NeurologicalDiagnoses | 0.795 | 0.66 | 0.66 |
| HowOftenUnableRememberDrinking | 0.786 | 0.61 | 0.651 |
| FeelBadGuiltyDrugUse | 0.782 | 0.642 | 0.642 |
| TrafficAccidentsLifeCount | 0.781 | 0.585 | 0.643 |
| HowOftenCantStopCannabis | 0.748 | 0.732 | 0.598 |
| InjuredDrinking | 0.734 | 0.626 | 0.592 |
| CarDebt | 0.716 | 0.698 | 0.559 |
| DaysHalfLastMonth | 0.715 | 0.552 | 0.559 |
| Worthless | 0.709 | 0.632 | 0.549 |
| DaysPhysicalHealthFeelings | 0.705 | 0.533 | 0.545 |
| Depressed | 0.689 | 0.561 | 0.527 |
| EverythingIsEffort | 0.682 | 0.553 | 0.517 |
| Hopeless | 0.678 | 0.579 | 0.516 |
| OtherDrugs | 0.658 | 0.492 | 0.492 |
| RestlessFidgety | 0.621 | 0.507 | 0.451 |
| TrafficTicketsLastYearCount | 0.621 | 0.438 | 0.45 |
| Nervous | 0.615 | 0.473 | 0.445 |
| NeglectedFamilyDrugUse | 0.488 | 0.342 | 0.342 |
| DaysLostLastMonth | 0.475 | 0.604 | 0.312 |
| EngagedInIllegalActsToObtainDrugs | 0.468 | 0.306 | 0.306 |
| AbleToStopDrugs | 0.458 | 0.306 | 0.306 |
| BlackoutFlashbackDrugUse | 0.424 | 0.272 | 0.272 |
| HowOftenFailedActivitiesCannabis | 0.423 | 0.636 | 0.378 |
| WidthdrawalSymptoms | 0.403 | 0.254 | 0.254 |
| Last30DaysUsual | 0.395 | 0.285 | 0.248 |
| GamblingProblem | 0.317 | 0.38 | 0.195 |
| AbuseMoreThanOneDrugAtATime | 0.294 | 0.173 | 0.173 |
| HowOftenHazardousCannabis | 0.057 | 0.369 | 0.029 |
| CaffieneOtherSourcesDayMG | 0 | 0.363 | -0.033 |
| SpouseParentsComplainDrugUse | -0.033 | -0.017 | -0.017 |
| OtherDebtAmount | -0.083 | 0.005 | -0.041 |
summary(demog_rel_df)
var icc spearman pearson
Length:74 Min. :-0.0831 Min. :-0.0166 Min. :-0.0406
Class :character 1st Qu.: 0.6793 1st Qu.: 0.5373 1st Qu.: 0.5161
Mode :character Median : 0.8323 Median : 0.7006 Median : 0.7156
Mean : 0.7417 Mean : 0.6558 Mean : 0.6446
3rd Qu.: 0.9257 3rd Qu.: 0.8401 3rd Qu.: 0.8628
Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
Due to our data collection strategy we did not strictly control for the delay between the two measurements as standard psychometrics studies measuring retest reliability might.
An individual difference measure would preferably remain stable regardless of the delay between multiple measurements.
Since we only had two measurements we could not test directly whether a measure becomes less reliable depending on the delay between the two time points (The average number of days between two measurements would be the same for all measures since reliability is a measure level metric while days between completion a subject level one).
So if you regress the average difference for each measure on average delay between two measurements you are regressing a vector with varying numbers on a vector of same values, like a t test asking if the mean of the varying column, in this case the average difference score, is different than the unique value in the single value column, i.e. the average delay between two time points (Note that this should be true in theory but in practice since for each measure there might subjects for whom the dv could not be calculated there might be >1 unique values for the average delay between the two time points). This analysis is not meaningful.
Instead of the summary metric like the retest reliability estimate for each measure we can check whether the difference score distribution for each measure depends on the delay between the two measurements. Since the difference score distribution is at subject level we can check whether the order of subjects in this distribution depends on their order in the distribution of days between completing the two tests.
Make data frame with difference between two scores for each measure for each subject. Since the scores for different measures are on different scales for comparability the difference scores are normalized (ie demeaned and dividied by the sd of the difference score distribution.) Note that the variance of the difference score distribution accounts for the variance in both time points by summing them. Normalization equates the means of each difference score distribution to 0 which would mask any meaningful change between the two time points but the analysis here does not interpret the mean of the difference score distributions but is interested in its relation to the days between completion. We check if the variables show systematic differences between the two points later.
Here we check if the difference is larger the longer the delay
numeric_cols = c()
for(i in 1:length(names(test_data))){
if(is.numeric(test_data[,i]) & names(test_data)[i] %in% names(retest_data)){
numeric_cols <- c(numeric_cols, names(test_data)[i])
}
}
t1_2_difference = data.frame()
for(i in 1:length(numeric_cols)){
tmp = match_t1_t2(numeric_cols[i],format='wide')
tmp = tmp %>%
mutate(difference = scale(`2` - `1`))
t1_2_difference = rbind(t1_2_difference, tmp)
}
t1_2_difference$difference = as.data.frame(t1_2_difference$difference)$V1
t1_2_difference = t1_2_difference %>% separate(dv, c("task", "dv2"), sep="\\.", remove=FALSE)
Add completion dates to this data frame.
retest_task_comp_times = read.csv(paste0(retest_data_path, 'Local/retest_task_completion_times.csv'))
test_task_comp_times = read.csv(paste0(retest_data_path, 'Local/test_task_completion_times.csv'))
task_comp_times = merge(retest_task_comp_times, test_task_comp_times, by=c('worker_id','task'))
rm(retest_task_comp_times, test_task_comp_times)
task_comp_times = task_comp_times %>%
select(-X.x, -X.y) %>%
mutate(finish_day.x = as.Date(finish_day.x),
finish_day.y = as.Date(finish_day.y),
days_btw = finish_day.x-finish_day.y) %>%
rename(sub_id=worker_id)
t1_2_difference = merge(t1_2_difference, task_comp_times[,c('sub_id', 'task','days_btw')], by=c('sub_id', 'task'))
What does the distribution of differences look like: The distribution of differences between two time points for each measure
t1_2_difference %>%
ggplot(aes(difference, alpha=dv))+
geom_histogram(position='identity')+
theme(legend.position = 'none')
How do the difference score distributions look like with respect to the days between completion?
t1_2_difference %>%
ggplot()+
geom_smooth(aes(as.numeric(days_btw), abs(difference), group=factor(dv)), method='lm', se=FALSE)+
geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
theme(legend.title = element_blank())+
xlab('Days between completion')+
ylab('Absolute Scaled difference score')
To test if the slope of the black is significant we would run a mixed effects model with a fixed effect for days between completion, random slope for each dv depending on the days between and random intercept for each dv.
Before I was using subjects as a random effect but days between the two time points for each measure depends on subj id. What varies randomly is which dv we are looking for its distribution of differences in relation to the days between the time points. So I changed the model to have fixed effect for the days between, a random slope for (dependent variables can be differentially sensitive to th effect of days between) and a random intercept for dependent variable.
Significant fixed effect suggests that on average the longer the delay the smaller the difference.
summary(lmerTest::lmer(abs(difference) ~ scale(days_btw)+(scale(days_btw) | dv), data=t1_2_difference))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: abs(difference) ~ scale(days_btw) + (scale(days_btw) | dv)
Data: t1_2_difference
REML criterion at convergence: 132273
Scaled residuals:
Min 1Q Median 3Q Max
-1.127 -0.719 -0.260 0.402 15.807
Random effects:
Groups Name Variance Std.Dev. Corr
dv (Intercept) 0.003190 0.0565
scale(days_btw) 0.000149 0.0122 1.00
Residual 0.468366 0.6844
Number of obs: 63454, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.72218 0.00382 457.00000 189.05 <2e-16 ***
scale(days_btw) -0.00985 0.00278 3236.00000 -3.55 0.0004 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
scl(dys_bt) 0.146
But if I just run one mixed effects model then we don’t get a sense of the simple effects (how many of the variables this effect of the days between is significant and in which direction). I can run it separately for each dv to see if all difference score distributions are affected the same way depending on the days between completion.
get_delay_effect = function(df){
mod = lm(abs(difference) ~ scale(days_btw), data = df)
out = data.frame(estimate=coef(summary(mod))["scale(days_btw)","Estimate"], pval=coef(summary(mod))["scale(days_btw)","Pr(>|t|)"])
return(out)
}
sig_days_effect = t1_2_difference %>%
group_by(dv) %>%
do(get_delay_effect(.)) %>%
filter(pval<0.05)
sig_days_effect
For visualization I used to summarize the difference scores per person by looking at the average difference per task per subject and plot that against the number of days between completion. This yields multiple points for each value on x representing the average difference per task for each subject. But variables within a task could go in different directions (e.g. if patient proportion increases discount rate decreases) so this doesn’t seem like a good idea)
Instead I now plot all the data grouping by dependant variable and coloring depending on whether the difference score distribution for that variable has a significant slope when regressed over days between. The black is the main effect for the large multilevel model (this is the same plot as above colored by whether the difference score distribution for each dv has a significant slope depending on days between).
t1_2_difference %>%
mutate(sig_days_effect = ifelse(dv %in% sig_days_effect$dv, 1, 0))%>%
arrange(-sig_days_effect) %>%
ggplot()+
stat_smooth (aes(as.numeric(days_btw), abs(difference),
group=factor(dv, levels=unique(dv[order(sig_days_effect)]), ordered=TRUE),
color=factor(sig_days_effect, levels = c(1,0))),geom="line", alpha=0.5, method='lm')+
geom_smooth(aes(as.numeric(days_btw), abs(difference)), method='lm', color = "black", se=FALSE)+
theme(legend.title = element_blank())+
xlab('Days between completion')+
ylab('Scaled difference score')+
scale_color_discrete(breaks=c(0,1),
labels=c("NS", "Sig"))
ggsave('DaysBtwEffect.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 6, height = 4, units = "in", limitsize = FALSE, dpi = 100)
Conclusion: The average difference between the scores of a measure if anything decreases with the increased delay. This fixed effect masks the simple effect for each dv. When looked at individually only 10 variables show a dependence on days between completion. While 7 of them show decreasing differences depending on days between 3 do show increasing differences. This suggests that the lack of stricter control over the days between completion of the measurement does not have a large effect on the stability of most measures.
effect of whether days between leads to a sig difference on reliability of that variable? plot coef of sig_day_effect over rel_df point estimate
Do the variables that show significant dependence on the delay between the completion times have systematically lower/higher reliability? Looking at the reliability point estimate over the size of the delay effect. There are very few variables to compare anyway but regardless doesn’t look conclusive/concerning.
#Create df of point estimate reliabilities
rel_df <- data.frame(spearman = rep(NA, length(numeric_cols)),
icc = rep(NA, length(numeric_cols)),
pearson = rep(NA, length(numeric_cols)),
partial_eta_sq = rep(NA, length(numeric_cols)),
sem = rep(NA, length(numeric_cols)),
var_subs = rep(NA, length(numeric_cols)),
var_ind = rep(NA, length(numeric_cols)),
var_resid = rep(NA, length(numeric_cols)))
row.names(rel_df) <- numeric_cols
for(i in 1:length(numeric_cols)){
rel_df[numeric_cols[i], 'spearman'] <- get_spearman(numeric_cols[i])
rel_df[numeric_cols[i], 'icc'] <- get_icc(numeric_cols[i])
rel_df[numeric_cols[i], 'pearson'] <- get_pearson(numeric_cols[i])
rel_df[numeric_cols[i], 'partial_eta_sq'] <- get_partial_eta(numeric_cols[i])
rel_df[numeric_cols[i], 'sem'] <- get_sem(numeric_cols[i])
rel_df[numeric_cols[i], 'var_subs'] <- get_var_breakdown(numeric_cols[i])$subs
rel_df[numeric_cols[i], 'var_ind'] <- get_var_breakdown(numeric_cols[i])$ind
rel_df[numeric_cols[i], 'var_resid'] <- get_var_breakdown(numeric_cols[i])$resid
}
rel_df$dv = row.names(rel_df)
row.names(rel_df) = seq(1:nrow(rel_df))
rel_df$task = 'task'
rel_df[grep('survey', rel_df$dv), 'task'] = 'survey'
rel_df[grep('holt', rel_df$dv), 'task'] = "task"
rel_df = rel_df %>%
select(dv, task, spearman, icc, pearson, partial_eta_sq, sem, var_subs, var_ind, var_resid)
# row.names(rel_df) = NULL
sig_days_effect %>%
left_join(rel_df, by='dv') %>%
ggplot(aes(estimate, icc))+
geom_point()
Some surveys have overlapping questions. Do these correlate within and across sessions?
First determine the overlapping questions.
tmp = read.csv(gzfile(paste0(retest_data_path, 'items.csv.gz')))
tmp = tmp %>%
filter(worker == 's005') %>%
select(item_ID, item_text) %>%
mutate(item_text = trimws(as.character(item_text))) %>%
unite(item, c("item_ID", "item_text"), sep = "___")
comb = as.data.frame(t(combn(unique(tmp$item),2)))
duplicate_items = comb %>%
filter(grepl('dospert', V1)==FALSE) %>%
filter(grepl('selection_optimization', V1)==FALSE) %>%
filter(grepl('sensation_seeking', V1)==FALSE) %>%
separate(V1, c("item1_ID", "item1_text"), sep="___") %>%
separate(V2, c("item2_ID", "item2_text"), sep="___") %>%
mutate(similarity = levenshteinSim(item1_text, item2_text)) %>%
filter(similarity>0.8) %>%
select(item1_ID, item2_ID, item1_text, item2_text)
duplicate_items
#surveys to read in
extract_items = c('worker',unique(with(duplicate_items, c(item1_ID, item2_ID))))
#correlations to compute:
#item1_t1 - item2_t1,
#item1_t2 - item2_t2,
#item1_t1 - item2_t2,
#item1_t2 - item2_t1
duplicate_items_data_t1 = read.csv(paste0(test_data_path, 'subject_x_items.csv'))
duplicate_items_data_t2 = read.csv(paste0(retest_data_path, 'subject_x_items.csv'))
duplicate_items_data_t1 = duplicate_items_data_t1 %>%
filter(worker %in% duplicate_items_data_t2$worker) %>%
select(extract_items)
duplicate_items_data_t2=duplicate_items_data_t2 %>%
filter(worker %in% duplicate_items_data_t1$worker) %>%
select(extract_items)
duplicate_items = duplicate_items %>%
mutate(t1_t1_cor = NA,
t2_t2_cor = NA,
t1_t2_cor = NA,
t2_t1_cor = NA,
t1_t1_polycor = NA,
t2_t2_polycor = NA,
t1_t2_polycor = NA,
t2_t1_polycor = NA)
for(i in 1:nrow(duplicate_items)){
duplicate_items$t1_t1_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t2_t2_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t1_t2_cor[i] = abs(cor(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t2_t1_cor[i] = abs(cor(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))
duplicate_items$t1_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t1_t2_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t1[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t2[,c(duplicate_items$item2_ID[i])]))$rho[2])
duplicate_items$t2_t1_polycor[i] = abs(polychoric(data.frame(duplicate_items_data_t2[,c(duplicate_items$item1_ID[i])],
duplicate_items_data_t1[,c(duplicate_items$item2_ID[i])]))$rho[2])
}
duplicate_items
summary(duplicate_items$t1_t1_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.284 0.355 0.583 0.567 0.742 0.822
summary(duplicate_items$t2_t2_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.101 0.446 0.612 0.580 0.700 0.904
summary(duplicate_items$t1_t2_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0128 0.3570 0.5710 0.5160 0.6870 0.8380
summary(duplicate_items$t2_t1_polycor)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0631 0.4490 0.5350 0.5010 0.5830 0.8190
Read in and process bootstrapped results.
process_boot_df = function(df){
df = df %>%
drop_na() %>%
mutate(dv = as.character(dv),
icc = as.numeric(as.character(icc)),
spearman = as.numeric(as.character(spearman)),
pearson = as.numeric(as.character(pearson)),
eta_sq = as.numeric(as.character(eta_sq)),
sem = as.numeric(as.character(sem)),
partial_eta_sq = as.numeric(as.character(partial_eta_sq)),
omega_sq = as.numeric(as.character(omega_sq)),
var_subs = as.numeric(as.character(var_subs)),
var_ind = as.numeric(as.character(var_ind)),
var_resid = as.numeric(as.character(var_resid)),
F_time = as.numeric(as.character(F_time)),
p_time = as.numeric(as.character(p_time)),
df_time = as.numeric(as.character(df_time)),
df_resid = as.numeric(as.character(df_resid)))
return(df)}
boot_df <- read.csv(gzfile(paste0(retest_data_path,'bootstrap_merged.csv.gz')))
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
boot_df = process_boot_df(boot_df)
boot_df = boot_df[boot_df$dv %in% retest_report_vars,]
# Check if you have all variables bootstrapped
# retest_report_vars[which(retest_report_vars %in% boot_df$dv==FALSE)]
# Boot df contains hddm parameters fit on the full sample in the t1 data
# refits_bootstrap_merged.csv.gz contains bootstrapped reliabilities
refit_boot_df = read.csv(gzfile(paste0(retest_data_path,'refits_bootstrap_merged.csv.gz')))
Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, : embedded nul(s) found in input
refit_boot_df = process_boot_df(refit_boot_df)
fullfit_boot_df = boot_df[as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]
boot_df = boot_df[!as.character(boot_df$dv) %in% unique(as.character(refit_boot_df$dv)),]
boot_df = rbind(boot_df, refit_boot_df)
rm(refit_boot_df)
Summarize bootstrapped results and merge to lit review data
var_boot_df = boot_df %>%
group_by(dv) %>%
summarise(mean_icc = mean(icc),
mean_pearson = mean(pearson))
rel_comp = lit_review %>%
left_join(var_boot_df, by = 'dv')
Warning: Column `dv` joining factor and character vector, coercing into
character vector
Here’s what our data looks like: (583 data points for 171 measures)
rel_comp
Distribution of reliabilities, sample sizes and delays
rel_comp %>%
select(dv, task, retest_reliability, sample_size, days) %>%
filter(days < 3600) %>%
gather(key, value, -dv, -task) %>%
ggplot(aes(value, fill=task))+
geom_density(alpha=0.5, position='identity')+
facet_wrap(~key, scales='free')+
theme(legend.title = element_blank())
summary(rel_comp$sample_size[rel_comp$task == "survey"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
11 38 73 126 153 791
summary(rel_comp$sample_size[rel_comp$task == "task"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 30.0 47.0 78.6 81.0 1120.0
The literature has smaller sized samples for task measures compared to survey measures that report retest reliability.
summary(lm(sample_size ~ task,rel_comp))
Call:
lm(formula = sample_size ~ task, data = rel_comp)
Residuals:
Min 1Q Median 3Q Max
-115.5 -57.6 -33.6 7.4 1043.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.50 8.47 14.94 < 2e-16 ***
tasktask -47.89 10.79 -4.44 0.000011 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 127 on 581 degrees of freedom
Multiple R-squared: 0.0328, Adjusted R-squared: 0.0311
F-statistic: 19.7 on 1 and 581 DF, p-value: 0.0000108
What predicts retest reliability in the literature? Task, sample size, days
mod1 = lmer(retest_reliability ~ task + (1|dv), rel_comp)
mod2 = lmer(retest_reliability ~ task + sample_size + (1|dv), rel_comp)
anova(mod1, mod2)
refitting model(s) with ML (instead of REML)
mod3 = lmer(retest_reliability ~ task + sample_size + days+ (1|dv), rel_comp)
anova(mod2, mod3)
refitting model(s) with ML (instead of REML)
summary(mod2)
Linear mixed model fit by REML ['lmerMod']
Formula: retest_reliability ~ task + sample_size + (1 | dv)
Data: rel_comp
REML criterion at convergence: -377.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.190 -0.488 0.127 0.550 2.575
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.0174 0.132
Residual 0.0206 0.143
Number of obs: 583, groups: dv, 171
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.7235198 0.0213512 33.9
tasktask -0.1405121 0.0258197 -5.4
sample_size -0.0001256 0.0000543 -2.3
Correlation of Fixed Effects:
(Intr) tsktsk
tasktask -0.780
sample_size -0.319 0.117
Tasks have significantly lower reliability and reliability decreases with increasing sample size.
rel_comp %>%
ggplot(aes(sample_size, retest_reliability, color=task))+
geom_smooth(method='lm')+
geom_point(alpha = 0.2)+
theme(legend.title = element_blank())+
xlab("Sample Size")+
ylab("Retest reliability")
I used to compare effect sizes of effect of task on reliability estimates in literature vs our results (just looking at the variables you find in the literature) but removed this analysis because
1. the estimates in the literature are not all the same statistic 2. from our data I was only using ICC’s 3. I was sampling from our bootstrapped results as many datapoints for each measure as there are in the literature but that didn’t seem like the best way to make use of our data to make it comparable to the literature.
Despite these problems the main result was that the effect size was much larger in our dataset compared to the literature but given the problems I think the sampling analyses below are more informative than this.
We also checked whether our results diverge most from studies with smaller sample sizes. Square difference between our mean estimate and the reliability from the literature decreases exponentially with sample size. The smaller the sample size in the literature the more the reliability estimate differs from our results. But this was a weak result because most of the studies in the literature have smaller sample sizes and you see both small and large deviations for these studies (these were not significant either).
Correlation between our mean estimates from bootstrapped samples and the literature review for task variables
n_df = rel_comp %>%
group_by(dv) %>%
tally()
lit_emp_cor = function(){
boot_comp = data.frame()
for(i in 1:length(unique(rel_comp$dv)) ){
cur_dv = unique(rel_comp$dv)[i]
n = n_df$n[n_df$dv == cur_dv]
sample_df = boot_df %>% filter(dv == cur_dv)
tmp = sample_n(sample_df, n)
boot_comp = rbind(boot_comp, tmp)
}
rm(cur_dv, n, sample_df, tmp)
#check if cbind is ok
# sum(boot_comp$dv == rel_comp$dv)
#cbinding pearson because that is the most common metric in the lit
rel_comp = cbind(rel_comp, boot_comp$pearson)
#rename new column
names(rel_comp)[which(names(rel_comp) == "boot_comp$pearson")] = "pearson"
out = data.frame(task = NA, survey = NA)
out$task = with(rel_comp %>% filter(task == "task"), cor(pearson, retest_reliability))
out$survey = with(rel_comp %>% filter(task == "survey"), cor(pearson, retest_reliability))
rel_comp = rel_comp[,-16]
return(out)
}
lit_emp_cor_out = plyr::rdply(100, lit_emp_cor)
write.csv(lit_emp_cor_out,'/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/lit_emp_cor_out.csv')
summary(lit_emp_cor_out)
.n task survey
Min. : 1.0 Min. :0.235 Min. :0.0381
1st Qu.: 25.8 1st Qu.:0.263 1st Qu.:0.1162
Median : 50.5 Median :0.274 Median :0.1392
Mean : 50.5 Mean :0.274 Mean :0.1381
3rd Qu.: 75.2 3rd Qu.:0.285 3rd Qu.:0.1582
Max. :100.0 Max. :0.323 Max. :0.2478
Model comparisons building model to predict the reliabilities in the literature from a sample from our results versus a sample form the literature.
sample one row per measure out of lit review r^2 of retest_reliability ~ sampled_reliability vs. r^2 of retest_reliability ~ mean_icc
coef of retest_reliability ~ sampled_reliability vs. coef of retest_reliability ~ mean_icc
comp_lit_pred <- function(df){
sample_from_dv <- function(df){
if(nrow(df)>1){
row_num = sample(1:nrow(df),1)
sample_row = df[row_num,]
df = df[-row_num,]
df$lit_predictor = sample_row$retest_reliability
}
return(df)
}
sampled_df = df %>%
group_by(dv) %>%
do(sample_from_dv(.))
mod_lit = lm(retest_reliability ~ lit_predictor+scale(sample_size)+task, data=sampled_df)
mod_boot = lm(retest_reliability ~ mean_pearson+scale(sample_size)+task, data=sampled_df)
out = data.frame(r2_lit = summary(mod_lit)$r.squared,
r2_boot = summary(mod_boot)$r.squared,
m_lit = coef(summary(mod_lit))["lit_predictor","Estimate"],
m_boot = coef(summary(mod_boot))["mean_pearson","Estimate"])
return(out)
}
comp_lit_pred_out = plyr::rdply(1000, comp_lit_pred(rel_comp))
tmp = comp_lit_pred_out %>%
select(-.n) %>%
gather(key, value) %>%
separate(key, c("stat", "sample"), sep = "_")
tmp$stat = as.factor(tmp$stat)
levels(tmp$stat) <- c("coefficient", expression(r^"2"))
tmp %>%
ggplot(aes(value, fill=sample))+
geom_density(alpha = 0.5, position='identity', color=NA)+
facet_grid(.~stat, scales='free', labeller = label_parsed)+
scale_fill_discrete(breaks=c("boot","lit"),
labels=c("Empirical", "Literature"),
name="Prediction")+
xlab('')+
ylab('')
ggsave('Lit_Noise_Ceiling.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 10, height = 4, units = "in", limitsize = FALSE, dpi = 100)
with(comp_lit_pred_out, t.test(r2_lit, r2_boot, paired=T))
Paired t-test
data: r2_lit and r2_boot
t = 29, df = 1000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.02467 0.02821
sample estimates:
mean of the differences
0.02644
with(comp_lit_pred_out, t.test(m_lit, m_boot))
Welch Two Sample t-test
data: m_lit and m_boot
t = 18, df = 1300, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.02791 0.03472
sample estimates:
mean of x mean of y
0.3334 0.3020
Based on Weir (2005) ICC(3,k) does not take in to account within subject differences between two time points (i.e. the fixed effect of time/systematic error). Thus, it is well approximated by Pearson’s r and subject to similar criticisms. Weir (2005) suggests reporting at least this systematic error effect size if one chooses to report with ICC(3,k). Based on his conclusions here I report:
- ICC(3,k): As Dave clarified this ranges from 1 to -1/(number of repeated measures -1) so in our case this range would be [-1, 1]; larger values would mean that the two scores of a subject for a given measure are more similar to each other than they are to scores of other people
- “ICC is reflective of the ability of a test to differentiate between different individuals” - partial \(\eta^2\) for time (\(SS_{time}/SS_{within}\)): effect size of time
- SEM (\(\sqrt(MS_{error})\)): standard error of measurement; the smaller the better. It “quantifies the precision of individual scores on a test” and is not dependent on the sample in the way ICC is since it doesn’t depend on between subject reliability (at least in this formulation) but is unit-dependent.
We calculated 74 measures for surveys and 372 measures for cognitive tasks.
Though we are primarily reporting ICC’s as our metric of reliability the results don’t change depending on the metric chosen. Here we plot point estimates of three different reliability metrics against each other (ICC, Pearson, Spearman). The bootstrapped version is essentially the same but the plots are busier due to more datapoints.
table(rel_df$task)
survey task
74 372
p1 = rel_df %>%
ggplot(aes(spearman, icc, col=task))+
geom_point()+
theme_bw()+
theme(legend.title = element_blank(),
legend.position = "none")+
geom_abline(intercept = 0, slope=1)
p2 = rel_df %>%
ggplot(aes(pearson, icc, col=task))+
geom_point()+
theme_bw()+
theme(legend.title = element_blank(),
legend.position = "none")+
geom_abline(intercept = 0, slope=1)
p3 = rel_df %>%
ggplot(aes(pearson, spearman, col=task))+
geom_point()+
theme_bw()+
theme(legend.title = element_blank(),
legend.position = "none")+
geom_abline(intercept = 0, slope=1)
grid.arrange(p1, p2, p3, nrow=1)
ggsave('Metric_Scatterplots.jpg', plot = grid.arrange(p1, p2, p3, nrow=1), device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 12, height = 4, units = "in", limitsize = FALSE, dpi = 100)
As the scatter plots depict the correlations between different types reliability metrics were very high.
cor(rel_df[,c('spearman', 'icc', 'pearson')])
spearman icc pearson
spearman 1.0000 0.9324 0.9577
icc 0.9324 1.0000 0.9800
pearson 0.9577 0.9800 1.0000
Note: Some variables have <0 ICC’s. This would be the case if the \(MS_{error}\)>\(MS_{between}\). Data for these variables have no relationship between the two time points.
Summarized bootstrapped reliabilities
boot_df %>%
group_by(dv) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975)) %>%
datatable() %>%
formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
# Df wrangling for plotting
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')
tmp = tmp %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
separate(var, c("var"), sep="\\.",remove=TRUE,extra="drop") %>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
arrange(task_group, var)
tmp = tmp %>%
left_join(rel_df[,c("dv", "icc")], by = "dv") %>%
rename(icc = icc.x, point_est = icc.y)
#Manual correction
tmp = tmp %>%
mutate(task = ifelse(task_group == 'holt laury survey', "task", as.character(task))) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group))
Variable level summary of bootstrapped reliabilities.
p4 <- tmp %>%
filter(task == 'task',
raw_fit == 'raw') %>%
ggplot(aes(y = var, x = icc)) +
geom_point(color = '#00BFC4')+
geom_point(aes(y = var, x = point_est), color = "black")+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
theme(panel.spacing = unit(0.75, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180, size=36),
axis.text.y = element_text(size=20),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey80")) +
xlab("")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
geom_vline(xintercept = 0, color = "red", size = 1)
p5 <- tmp %>%
filter(task == 'survey') %>%
ggplot(aes(y = var, x = icc)) +
geom_point(color = '#F8766D')+
geom_point(aes(y = var, x = point_est), color = "black")+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y", labeller = label_wrap_gen(width=20)) +
theme(panel.spacing = unit(0.75, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180, size=36),
axis.text.y = element_text(size=20),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey80")) +
xlab("")+
ylab("")+
scale_x_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 10))+
geom_vline(xintercept = 0, color = "red", size = 1)
p6 <- arrangeGrob(p4, p5,nrow=1)
ggsave('Bootstrap_Raw_Var_Plot.jpg', plot = p6, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 36, height = 72, units = "in", limitsize = FALSE, dpi=50)
rm(p4, p5, p6)
p4_t <- tmp %>%
filter(task == 'task') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) +
geom_violin(fill='#00BFC4')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p5_t <- tmp %>%
filter(task == 'survey') %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) +
geom_violin(fill='#F8766D')+
theme_bw() +
theme(axis.text.y = element_text(size=43))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p6_t <- arrangeGrob(p4_t, p5_t,nrow=1)
ggsave('Bootstrap_Poster_Plot_t.jpg', plot = p6_t, device = "jpeg", path = "../output/figures/", width = 27, height = 48, units = "in", limitsize = FALSE, dpi = 100)
Example of the overlaying procedure.
p1<- tmp %>%
filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = var, y = icc)) +
geom_violin(fill='#F8766D')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p2<- tmp %>%
filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc, group=dv)) +
geom_violin(fill='#F8766D', position = 'identity')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p3<- tmp %>%
filter(grepl('selection_optimization', dv)) %>%
ggplot(aes(x = factor(task_group, levels=rev(unique(task_group))), y = icc)) +
geom_violin(fill='#F8766D', position = 'identity')+
theme_bw() +
theme(axis.text.y = element_text(size=30))+
xlab("")+
ylab("")+
scale_y_continuous(limits = c(-0.25,1), breaks=c(-0.25, 0, 0.25, 0.5, 0.75, 1))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_hline(yintercept = 0, color = "red", size = 1)+
coord_flip()
p4 <- arrangeGrob(p1,p2, p3,nrow=1)
ggsave('Bootstrap_Example_Plot_t.jpg', plot = p4, device = "jpeg", path = "../output/figures/", width = 41, height = 5, units = "in", limitsize = FALSE, dpi = 100)
rm(p1, p2, p3, p4)
Comparison of survey measures to cognitive task measures in the bootstrapped results. Multilevel model with random intercepts for each measure and fixed effect of survey versus cognitive measure.
boot_df = boot_df %>%
mutate(task = ifelse(grepl("survey",dv), "survey","task"),
task = ifelse(grepl("holt",dv), "task", task))
boot_df %>%
group_by(task) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975),
num_vars = n()/1000) %>%
datatable() %>%
formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
summary(lmerTest::lmer(icc ~ task + (1|dv), boot_df))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ task + (1 | dv)
Data: boot_df
REML criterion at convergence: -942508
Scaled residuals:
Min 1Q Median 3Q Max
-13.201 -0.401 0.024 0.459 7.051
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.05354 0.2314
Residual 0.00701 0.0837
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.8725 0.0269 449.0000 32.4 <2e-16 ***
tasktask -0.3903 0.0295 449.0000 -13.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
The quantiative explanation for the difference in reliability estimates between surveys and tasks, as recently detailed by Hedge et al. (2017), lies in the difference in sources of variance between these measures. Specifically, the ICC is calculated as the ratio of variance between subjects variance to all sources of variance. Thus, measures with high between subjects variance would have high test-retest reliability. Intuitively, measures with high between subjects variance are also better suited for individual difference analyses as they would capture the differences between the subjects in a sample.
Here we first plot the percentage of variance explained by the three sources of variance for the point estimates of measure reliabilities. The plot only includes raw measures (no DDM parameters) and the measures are ranked by percentage of between subject variability for each task/survey (i.e. the best to worst individual difference measure for each task/survey). Then we compare statistically whether the percentage of variance explained by these sources differ between tasks and surveys.
tmp = rel_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
mutate(dv = factor(dv, levels = dv[order(task)])) %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
arrange(task_group, var_subs_pct) %>%
mutate(rank = row_number()) %>%
arrange(task, task_group, rank) %>%
gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
ungroup()%>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group)) %>%
filter(task=="task",
!grepl("EZ|hddm", dv))%>%
arrange(task_group, rank)
labels = tmp %>%
distinct(dv, .keep_all=T)
p1 <- tmp %>%
ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
geom_bar(stat='identity', alpha = 0.75, color='#00BFC4')+
scale_x_discrete(breaks = labels$rank,
labels = labels$var)+
coord_flip()+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(0.5, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey85"),
legend.position = 'bottom')+
theme(legend.title = element_blank())+
scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
labels = c("Variance between individuals",
"Variance between sessions",
"Error variance"),
values=c("grey65", "grey45", "grey25"))+
ylab("")+
xlab("")
tmp = rel_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct) %>%
mutate(dv = factor(dv, levels = dv[order(task)])) %>%
separate(dv, c("task_group", "var"), sep="\\.",remove=FALSE,extra="merge") %>%
mutate(task_group = factor(task_group, levels = task_group[order(task)])) %>%
arrange(task_group, var_subs_pct) %>%
mutate(rank = row_number()) %>%
arrange(task, task_group, rank) %>%
gather(key, value, -dv, -task_group, -var, -task, -rank) %>%
ungroup()%>%
mutate(task_group = gsub("_", " ", task_group),
var = gsub("_", " ", var)) %>%
mutate(task_group = ifelse(task_group == "psychological refractory period two choices", "psychological refractory period", ifelse(task_group == "angling risk task always sunny", "angling risk task",task_group))) %>%
mutate(task_group = gsub("survey", "", task_group)) %>%
filter(task=="survey")%>%
arrange(task_group, rank)
labels = tmp %>%
distinct(dv, .keep_all=T)
p2 <- tmp %>%
ggplot(aes(x=factor(rank), y=value, fill=factor(key, levels = c("var_resid_pct", "var_ind_pct", "var_subs_pct"))))+
geom_bar(stat='identity', alpha = 0.75)+
geom_bar(stat='identity', color='#F8766D', show.legend=FALSE)+
scale_x_discrete(breaks = labels$rank,
labels = labels$var)+
coord_flip()+
facet_grid(task_group~., switch = "y", scales = "free_y", space = "free_y") +
theme(panel.spacing = unit(0.5, "lines"),
strip.placement = "outside",
strip.text.y = element_text(angle=180),
panel.background = element_rect(fill = NA),
panel.grid.major = element_line(colour = "grey85"),
legend.position = 'bottom')+
theme(legend.title = element_blank())+
scale_fill_manual(breaks = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
labels = c("Variance between individuals",
"Variance between sessions",
"Error variance"),
values=c("grey65", "grey45", "grey25"))+
ylab("")+
xlab("")
mylegend<-g_legend(p2)
p3 <- arrangeGrob(arrangeGrob(p1 +theme(legend.position="none"),
p2 + theme(legend.position="none"),
nrow=1),
mylegend, nrow=2,heights=c(10, 1))
ggsave('Variance_Breakdown_Plot.jpg', plot = p3, device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 24, height = 20, units = "in", dpi = 100)
rm(tmp, labels, p1, p2 , p3)
Comparing types of variance for survey vs task measures: Survey measures have higher between subject variability
Note: This analysis includes DDM variables too.
Running separate models for different sources of variance because interactive model with variance type*task seemed too complicated.
First we find that task measures have a smaller percentage of their overall variance explained by variability between subjects compared to survey measures.
tmp = boot_df %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid)*100,
var_ind_pct = var_ind/(var_subs+var_ind+var_resid)*100,
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)*100) %>%
select(dv, task, var_subs_pct, var_ind_pct, var_resid_pct)
summary(lmerTest::lmer(var_subs_pct~task+(1|dv),tmp%>%select(-var_ind_pct,-var_resid_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: var_subs_pct ~ task + (1 | dv)
Data: tmp %>% select(-var_ind_pct, -var_resid_pct)
REML criterion at convergence: 3304624
Scaled residuals:
Min 1Q Median 3Q Max
-4.316 -0.621 0.103 0.647 5.381
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 188 13.7
Residual 96 9.8
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 80.33 1.60 443.00 50.4 <2e-16 ***
tasktask -28.81 1.75 443.00 -16.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
We also find that a significantly larger percentage of their variance is explained by between session variability. Larger between session variability suggests systematic differences between the two sessions. Such systematic effects can be due to e.g. learning effects as explored later.
summary(lmerTest::lmer(var_ind_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_resid_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: var_ind_pct ~ task + (1 | dv)
Data: tmp %>% select(-var_subs_pct, -var_resid_pct)
REML criterion at convergence: 3611691
Scaled residuals:
Min 1Q Median 3Q Max
-4.412 -0.651 -0.139 0.584 4.736
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 239 15.5
Residual 191 13.8
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 9.84 1.80 444.00 5.47 7.4e-08 ***
tasktask 14.36 1.97 444.00 7.29 1.4e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
summary(lmerTest::lmer(var_resid_pct~task+(1|dv),tmp%>%select(-var_subs_pct,-var_ind_pct)))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: var_resid_pct ~ task + (1 | dv)
Data: tmp %>% select(-var_subs_pct, -var_ind_pct)
REML criterion at convergence: 2777592
Scaled residuals:
Min 1Q Median 3Q Max
-6.323 -0.475 0.018 0.530 6.343
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 75.2 8.67
Residual 29.4 5.43
Number of obs: 446000, groups: dv, 446
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 9.83 1.01 446.00 9.75 <2e-16 ***
tasktask 14.45 1.10 446.00 13.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
tasktask -0.913
tmp_save = tmp%>%
gather(key, value, -dv, -task) %>%
group_by(task, key) %>%
summarise(median = median(value),
sd = sd(value)) %>%
mutate(key = ifelse(key == 'var_ind_pct', 'Between session variance', ifelse(key == 'var_subs_pct', 'Between subjects variance', ifelse(key == 'var_resid_pct', 'Residual variance',NA)))) %>%
rename(Median = median, SD = sd) %>%
arrange(task, key)
tmp_save
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/var_breakdown_summary.doc")
| task | key | Median | SD |
|---|---|---|---|
| survey | Between session variance | 5.384 | 11.353 |
| survey | Between subjects variance | 83.355 | 11.686 |
| survey | Residual variance | 8.976 | 4.325 |
| task | Between session variance | 18.077 | 22.114 |
| task | Between subjects variance | 52.548 | 17.681 |
| task | Residual variance | 22.818 | 11.012 |
Summarizing for clearer presentation. This graph is currently using the bootstrapped reliabilities and is therefore messier than if just using the point estimates.
tmp %>%
gather(key, value, -dv, -task) %>%
group_by(task, key) %>%
summarise(mean_pct = mean(value),
sd_pct = sd(value),
n = n()) %>%
mutate(cvl = qt(0.025, n-1),
cvu = qt(0.975, n-1),
cil = mean_pct+(sd_pct*cvl)/sqrt(n),
ciu = mean_pct+(sd_pct*cvu)/sqrt(n),
sem_pct = sd_pct/sqrt(n)) %>%
ggplot(aes(factor(key, levels = c("var_subs_pct", "var_ind_pct", "var_resid_pct"),
labels = c("Variance between individuals",
"Variance between sessions",
"Error variance")), mean_pct, color=task))+
geom_point(position=position_dodge(width = 0.25), size = 4)+
# geom_errorbar(aes(ymin=mean_pct-sd_pct, ymax=mean_pct+sd_pct), position=position_dodge(width = 0.25), width=0, size=2)+
geom_errorbar(aes(ymin=cil, ymax=ciu), position=position_dodge(width = 0.25), width=0.25, size=2)+
theme_bw()+
xlab('')+
ylab('Percent')+
theme(legend.title = element_blank(),
legend.text = element_text(size=14),
legend.position = 'bottom',
axis.text = element_text(size=12),
axis.title.y = element_text(size=12))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))
ggsave('Variance_Breakdown_DotPlot.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 6, height = 5, units = "in", dpi = 100)
The type of ICC we have chosen does not take within subject (between session/systematic) variance in to account. This is why Weir recommends checking whether there is a significant change based on time and examining the SEMs. These systematic effects could be meaningful and important to account for for some measures (e.g. task measures that show learning effects).
Had we chosen another kind of ICC taking this source of variance into account (e.g. 2,1 or 2,k) they could have suggested that tasks have lower reliability.
Doing a simple t-test on the difference score alone would not be a very rigourous way of testing whether any change is meaningful because two distributions from both time points with error would be compared to each other. Fortunately there are ways to take the error for both measurements in to account.
To check whether a measure shows systematic differences between the two time points in a meaningful number of bootstrapped samples we can: check if the effect of time is significant in each bootstrapped sample and filter variables that have more than 5% of the boostrapped samples showing significant time effects.
Another way might be to compute confidence intervals using SEMs as described in the second half of Weir (2005) and check what percent of participants have scores that fall out of this range. I haven’t pursued this for now.
Here we ask: Which variables have significant time effects in more than 5% of the bootstrapped samples?
23/74 survey measures 133/372 task measures
boot_df %>%
select(dv, p_time, task) %>%
mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
group_by(dv)%>%
summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
task = unique(task))%>%
filter(pct_sig_time_effect>5) %>%
arrange(task,-pct_sig_time_effect) %>%
ungroup()%>%
group_by(task) %>%
summarise(count=n())
Are these significantly different between surveys and tasks? No.
chisq.test(matrix(data = c(23,74-23,133, 372-133), nrow=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(data = c(23, 74 - 23, 133, 372 - 133), nrow = 2)
X-squared = 0.4, df = 1, p-value = 0.5
Are they predominantly in one or the other direction and does that differ depending on survey vs task?
boot_df %>%
select(dv, p_time, task, pearson) %>%
mutate(time_effect_sig = ifelse(p_time<0.05,1,0)) %>%
group_by(dv)%>%
summarise(pct_sig_time_effect = sum(time_effect_sig)/10,
task = unique(task),
mean_pearson = mean(pearson))%>%
filter(pct_sig_time_effect>5) %>%
arrange(task,-pct_sig_time_effect) %>%
arrange(mean_pearson)
Step 1: T = Grand mean + ICC * (Subject score - Grand mean) Step 2: SEP = SD of both measurements * sqrt(1-ICC^2)
Here I calculate the proportion of subjects that move more than one standard error of prediction (SEP) in t2 compared to t1 for each measure.
This is odd to think about because the larger the ICC of a measure the smaller the SEP. So very small differences between the two time points can be categorized as ‘meaningful’ based on the tiny SEP.
What you are interested in is not necessarily whether individuals change at all between the two time points though. You want to know if this change is systematic in one direction.
Here I calculate the proportion of people showing ‘meaningful’ changes in one direction or the other. To integrate both of these direction I subtracted one from the other and filtered the variables that have more than 5% of the participants showing meaningful change in one direction over the other (so if a variable has 10 participants showing difference in one and 10 in the other direction this would cancel out but a variables with 15 people showing a positive and 5 negative change would remain).
get_ind_ci = function(dv_var){
matched = match_t1_t2(dv_var)
grand_mean = mean(matched$score)
grand_sd = sd(matched$score)
dv_icc = get_icc(dv_var)
sep = grand_sd * sqrt(1-(dv_icc^2))
matched = matched %>%
spread(time, score) %>%
rename("t1"="1", "t2"="2") %>%
mutate(true_score = grand_mean+(dv_icc*(t1-grand_mean)),
ci_up = true_score+(1.96*sep),
ci_low = true_score-(1.96*sep),
t2_above_ci = ifelse(t2>ci_up,1,0),
t2_below_ci = ifelse(t2<ci_low,1, 0))
return(matched)
}
get_prop_out_ci = function(dv_var){
get_ind_ci(dv_var) %>%
summarise(prop_above_ci = sum(t2_above_ci)/n(),
prop_below_ci = sum(t2_below_ci/n()))
}
#Create df of point estimate reliabilities
ind_ci_df <- data.frame(prop_above_ci = rep(NA, length(numeric_cols)),
prop_below_ci = rep(NA, length(numeric_cols)))
row.names(ind_ci_df) <- numeric_cols
for(i in 1:length(numeric_cols)){
ind_ci_df[numeric_cols[i], 'prop_above_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_above_ci
ind_ci_df[numeric_cols[i], 'prop_below_ci'] <- get_prop_out_ci(numeric_cols[i])$prop_below_ci
}
ind_ci_df$dv = row.names(ind_ci_df)
row.names(ind_ci_df) = seq(1:nrow(ind_ci_df))
ind_ci_df$task = 'task'
ind_ci_df[grep('survey', ind_ci_df$dv), 'task'] = 'survey'
ind_ci_df[grep('holt', ind_ci_df$dv), 'task'] = "task"
ind_ci_df = ind_ci_df %>%
select(dv, task, prop_above_ci, prop_below_ci)
mean_diff_df = ind_ci_df %>%
mutate(prop_one_direction = prop_above_ci-prop_below_ci) %>%
filter(abs(prop_one_direction)>0.05) %>%
arrange(-abs(prop_one_direction))
mean_diff_df
# write.csv(mean_diff_df, '/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/mean_diff_df.csv')
This doesn’t make it easier to interpret whether it is a performance improvement or decline. In fact ‘performance’ isn’t even necessarily the correct term here. Using the valence assignments of the measures look at whether it translates to an increase or decrease in self-regulation. Of the 66 this makes sense for 54 variables.
mean_diff_df = read.csv('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/mean_diff_df.csv')
table(as.character(mean_diff_df$valence), useNA = "ifany")
Neg Pos <NA>
24 30 12
mean_diff_df %>%
mutate(sc_increase = ifelse(prop_one_direction>0&valence=="Pos",1,ifelse(prop_one_direction<0&valence=="Neg",1,0)),
sc_decrease = ifelse(prop_one_direction>0&valence=="Neg",1,ifelse(prop_one_direction<0&valence=="Pos",1,0))) %>%
summarise(sum_sc_increase = sum(sc_increase,na.rm=T),
sum_sc_decrease = sum(sc_decrease,na.rm=T))
Variables that show more than 5% difference in a direction do not depend on whether they translate to a self control increase or decrease.
chisq.test(matrix(data = c(23,54-23,31,54-31), nrow=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(data = c(23, 54 - 23, 31, 54 - 31), nrow = 2)
X-squared = 1.8, df = 1, p-value = 0.2
Do they differ depending on whether they are task or survey variables? No.
chisq.test(matrix(data = c(5,74-5,61,372-61), nrow=2))
Pearson's Chi-squared test with Yates' continuity correction
data: matrix(data = c(5, 74 - 5, 61, 372 - 61), nrow = 2)
X-squared = 3.8, df = 1, p-value = 0.05
rm(mean_diff_df)
Here we summarize the results on a task level to make it more digestable and easier to make contact with the literature.
We reduce the list of task measures to a list of one per task by averaging only the raw measures from all the trials in a task. We chose to reduce the information in this manner to avoid any bias stemming from differential amount of interest and procedures applied to certain tasks over others (e.g. a task can have over 10 measures because it has multiple conditions and we have chosen to fit DDM’s for specific conditions while another might only have 2 due to our relative inexperience and lack of interest in it). We check whether the number of trials in a task has a significant effect on these average reliabilities of raw measures as well.
We filter out the DDM parameters and measures for specific contrasts. Note that this does leave some tasks with measures that are model fits and/or for specific conditions (because at least the current datasets do not include measures that are based on all the trials and are raw though I could dive in to variables_exhaustive for such measures. For example the average relialibility for Kirby is based on three discount rates for specific conditions.). Here’s the order of tasks by mean reliability sorted for ICC and then Spearman’s \(\rho\).
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task') %>%
left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') %>%
filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2) %>%
group_by(task_name) %>%
summarise(median_icc = median(icc),
median_spearman = median(spearman),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
min_spearman = min(spearman),
max_spearman = max(spearman),
num_measures = n()/1000,
mean_num_trials = round(mean(num_all_trials)))%>%
arrange(-median_icc, -median_spearman)
tmp %>%
datatable() %>%
formatRound(columns=c('median_spearman', 'median_icc',
'min_spearman', 'icc_2.5',
'max_spearman', 'icc_97.5'), digits=3)
tmp = tmp%>%
select(-median_spearman, -min_spearman, -max_spearman) %>%
mutate(task_name = gsub("_", " ", task_name),
task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE))
names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)
sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/task_rel_table.doc")
| Task Name | Median Icc | Icc 2.5 | Icc 97.5 | Num Measures | Mean Num Trials |
|---|---|---|---|---|---|
| Ravens | 0.875 | 0.845 | 0.904 | 1 | 18 |
| Adaptive N Back | 0.842 | 0.362 | 0.911 | 3 | 380 |
| Kirby | 0.835 | 0.746 | 0.897 | 12 | 14 |
| Cognitive Reflection Survey | 0.783 | 0.669 | 0.865 | 2 | 6 |
| Threebytwo | 0.772 | 0.694 | 0.845 | 2 | 383 |
| Recent Probes | 0.772 | 0.629 | 0.847 | 2 | 68 |
| Stroop | 0.76 | 0.445 | 0.844 | 7 | 67 |
| Simple Reaction Time | 0.738 | 0.668 | 0.826 | 1 | 149 |
| Psychological Refractory Period Two Choices | 0.732 | 0.499 | 0.85 | 4 | 186 |
| Choice Reaction Time | 0.727 | 0.506 | 0.848 | 2 | 150 |
| Hierbackground-color:#eaeaea;hical Rule | 0.707 | 0.592 | 0.792 | 3 | 298 |
| Attention Network Task | 0.703 | -0.513 | 0.814 | 6 | 75 |
| Simon | 0.692 | 0.398 | 0.821 | 8 | 61 |
| Directed Forgetting | 0.681 | 0.479 | 0.782 | 2 | 67 |
| Digit Span | 0.675 | 0.544 | 0.786 | 2 | 10 |
| Shape Matching | 0.673 | 0.57 | 0.763 | 2 | 272 |
| Information Sampling Task | 0.67 | 0.118 | 0.859 | 8 | 10 |
| Stop Signal | 0.661 | 0.261 | 0.815 | 11 | 304 |
| Spatial Span | 0.658 | 0.382 | 0.794 | 2 | 10 |
| Discount Titrate | 0.645 | 0.496 | 0.753 | 1 | 36 |
| Go Nogo | 0.643 | 0.256 | 0.785 | 5 | 210 |
| Keep Track | 0.641 | 0.497 | 0.719 | 1 | 9 |
| Tower Of London | 0.634 | 0.436 | 0.855 | 4 | 12 |
| Dot Pattern Expectancy | 0.599 | 0.111 | 0.798 | 11 | 61 |
| Columbia Card Task Hot | 0.597 | 0.433 | 0.857 | 5 | 24 |
| Stim Selective Stop Signal | 0.577 | 0.375 | 0.752 | 4 | 104 |
| Two Stage Decision | 0.575 | 0.294 | 0.746 | 4 | 198 |
| Local Global Letter | 0.57 | 0.285 | 0.76 | 12 | 31 |
| Dietary Decision | 0.564 | 0.088 | 0.749 | 3 | 48 |
| Columbia Card Task Cold | 0.524 | 0.248 | 0.688 | 5 | 24 |
| Holt Laury Survey | 0.524 | -0.122 | 0.659 | 3 | 10 |
| Shift Task | 0.476 | -0.056 | 0.771 | 13 | 401 |
| Angling Risk Task Always Sunny | 0.428 | 0.039 | 0.636 | 6 | 21 |
| Motor Selective Stop Signal | 0.423 | -0.357 | 0.798 | 4 | 63 |
| Probabilistic Selection | 0.402 | -0.194 | 0.752 | 7 | 54 |
| Writing Task | 0.4 | 0.207 | 0.538 | 2 | 1 |
| Bickel Titrator | 0.066 | -0.033 | 0.783 | 4 | 30 |
Does number of items in a task have a significant effect on the average ICC of (mostly) raw measures for all trials from a task? No. (no effect on Spearman either)
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task') %>%
# left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2)
# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ num_all_trials + (1 | dv)
Data: tmp
REML criterion at convergence: -408213
Scaled residuals:
Min 1Q Median 3Q Max
-12.548 -0.460 0.037 0.516 7.553
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.04556 0.2135
Residual 0.00555 0.0745
Number of obs: 174000, groups: dv, 174
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 5.81e-01 2.11e-02 1.72e+02 27.60 <2e-16 ***
num_all_trials 2.12e-05 1.19e-04 1.72e+02 0.18 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
num_ll_trls -0.640
measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task') %>%
# left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2) %>%
ggplot(aes(num_all_trials, icc))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")
The above analysis was looking at the effect of number of trials across tasks. But some tasks might be bad for individual difference measurement regardless of how many trials there are in them whereas for others fewer trials might be yielding a sufficiently reliable measure.
For tasks for which dependent variables are estimated using many trials one can ask: Does the same measure get less reliable if fewer trials are used to estimate its reliability?
This won’t make sense for all tasks. For example to estimate a risk aversion parameter you need all trials for Holt and Laury. For Kirby and Bickel you have specific conditions looking at fewer trials. The Cognitive Reflection Task might be more appropriate to analyze each item seaprately. The writing task does not have trial numbers. For all others it might be interesting to investigate.
These kinds of analyses are too task-specific and in-depth for a paper that is trying to give a global sense of the differences between self-regulation measures in their suitablity for individual difference analyses based on their stability across time. Such analyses would provide a detailed examination of how to extract the most reliable/best individual difference measure from tasks with a set of mediocre variables to begin with. Though we do not provide such a comprehensive analysis of this sort in this paper we provide a single example of this approach and hope the open access we provide to the data spurs further work.
For this example we look at the retest reliability of the three dependent measures (average accuracy, median response time, total score) from the hierarchical rule task with 360 trials. Here is a graph of how the point estimates of the retest reliability changes for each of the dependent measures using different numbers of trials to estimate them. While the reliability estimate for each of the variables respectively are …, … and … using all of the trials, these values are independent of number of trials used in the task for only certain variables. Based on this analysis researchers might decide to use a version of the task with fewer trials or only measures with consistently high reliability estimates.
Example task: three by two
Post process dv’s from cluster to calculate reliabilities
t1_dvs = read.csv(paste0(retest_data_path, 't1_tbt_dvs.csv'))
t2_dvs = read.csv(paste0(retest_data_path, 't2_tbt_dvs.csv'))
t1_dvs = t1_dvs %>% select(-X)
t2_dvs = t2_dvs %>% select(-X)
hr_merge = merge(t1_dvs, t2_dvs, by = c("sub_id", "breaks"))
hr_merge = hr_merge %>%
gather(key, value, -sub_id, -breaks) %>%
separate(key, c("dv", "time"), sep="\\.") %>%
mutate(time = ifelse(time == "x", 1, 2))
t1_dvs = hr_merge %>%
filter(time == 1) %>%
select(-time) %>%
spread(dv, value)
t2_dvs = hr_merge %>%
filter(time == 2) %>%
select(-time) %>%
spread(dv, value)
# calculate point estimates for reliability of each of the variables for each break
# get_icc for each break of tmp_t1_dvs and tmp_t2_dvs
trial_num_rel_df = data.frame(breaks=rep(NA, length(unique(t1_dvs$breaks))),
acc_icc=rep(NA, length(unique(t1_dvs$breaks))),
avg_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
std_rt_error_icc=rep(NA, length(unique(t1_dvs$breaks))),
avg_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
std_rt_icc=rep(NA, length(unique(t1_dvs$breaks))),
missed_percent_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_rt_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_rt_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
cue_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_acc_100_icc=rep(NA, length(unique(t1_dvs$breaks))),
task_switch_cost_acc_900_icc=rep(NA, length(unique(t1_dvs$breaks))))
for(i in 1:length(unique(t1_dvs$breaks))){
cur_break = unique(t1_dvs$breaks)[i]
tmp_t1_dvs = t1_dvs %>% filter(breaks == cur_break)
tmp_t2_dvs = t2_dvs %>% filter(breaks == cur_break)
trial_num_rel_df$breaks[i] = cur_break
trial_num_rel_df$acc_icc[i] = get_icc("acc", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$avg_rt_error_icc[i] = get_icc("avg_rt_error", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$std_rt_error_icc[i] = get_icc("std_rt_error", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$avg_rt_icc[i] = get_icc("avg_rt", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$std_rt_icc[i] = get_icc("std_rt", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$missed_percent_icc[i] = get_icc("missed_percent", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_rt_100_icc[i] = get_icc("cue_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_rt_900_icc[i] = get_icc("cue_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_rt_100_icc[i] = get_icc("task_switch_cost_rt_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_rt_900_icc[i] = get_icc("task_switch_cost_rt_900", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_acc_100_icc[i] = get_icc("cue_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$cue_switch_cost_acc_900_icc[i] = get_icc("cue_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_acc_100_icc[i] = get_icc("task_switch_cost_acc_100", tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$trial_switch_cost_acc_900_icc[i] = get_icc("task_switch_cost_acc_900", tmp_t1_dvs, tmp_t2_dvs)
}
rm(i, cur_break, tmp_t1_dvs, tmp_t2_dvs)
trial_num_rel_df$breaks = as.numeric(trial_num_rel_df$breaks)
write.csv(trial_num_rel_df, paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))
# trial_num_rel_df = read.csv(paste0(retest_data_path, 'trial_num_rel_df_tbt.csv'))
cols <- c("Accuracy" = '#084594', "Average RT error" = '#99000d', "SD RT error" = '#cb181d', "Average RT correct" = '#ef3b2c', "SD RT correct" = '#fb6a4a', "Missed percentage" = '#2171b5', "Cue switch cost RT (100)" = '#fc9272', "Cue switch cost RT (900)" = '#fcbba1', "Task switch cost RT (100)" = '#fee0d2', "Task switch cost RT (900)" = '#fff5f0', "Cue switch cost Acc (100)" = '#4292c6', "Cue switch cost Acc (900)" = '#6baed6', "Task switch cost Acc (100)" = '#9ecae1', "Task switch cost Acc (900)" = '#c6dbef')
trial_num_rel_df %>%
gather(key, value, -breaks) %>%
ggplot(aes(breaks*10, value, color=factor(key, levels = c("acc_icc", "avg_rt_error_icc", "std_rt_error_icc", "avg_rt_icc", "std_rt_icc", "missed_percent_icc", "cue_switch_cost_rt_100_icc", "cue_switch_cost_rt_900_icc", "task_switch_cost_rt_100_icc", "task_switch_cost_rt_900_icc", "cue_switch_cost_acc_100_icc", "cue_switch_cost_acc_900_icc", "task_switch_cost_acc_100_icc", "task_switch_cost_acc_900_icc"), labels = c("Accuracy", "Average RT error", "SD RT error", "Average RT correct", "SD RT correct", "Missed percentage", "Cue switch cost RT (100)", "Cue switch cost RT (900)", "Task switch cost RT (100)", "Task switch cost RT (900)", "Cue switch cost Acc (100)", "Cue switch cost Acc (900)", "Task switch cost Acc (100)", "Task switch cost Acc (900)"))))+
geom_point()+
geom_line()+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")+
theme(legend.title = element_blank(),
legend.position = 'bottom')+
scale_color_manual(values = cols,
breaks = c("Average RT error", "Accuracy", "SD RT error", "Missed percentage", "Average RT correct", "Cue switch cost Acc (100)", "SD RT correct", "Cue switch cost Acc (900)", "Cue switch cost RT (100)", "Task switch cost Acc (100)", "Task switch cost RT (100)", "Task switch cost Acc (900)", "Task switch cost RT (900)"))+
ylim(-1,1)+
guides(color = guide_legend(ncol=2, byrow=T))
ggsave('Intrameasure_Trialnum_Dependendence.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures", width = 10, height = 5, units = "in", dpi = 100)
Checking DDM results in the bootstrapped estimates. Variables using all trials are significantly more reliable compared to difference scores. Raw measures don’t differ from DDM parameters. Which DDM is better depends on whether all trials are used.
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(ddm_task == 1,
overall_difference != "condition",
rt_acc != 'other') %>%
drop_na() %>%
left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')
tmp %>%
drop_na() %>% #try removing this in final release
group_by(overall_difference, raw_fit, rt_acc) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
spearman_median = quantile(spearman, probs = 0.5),
spearman_2.5 = quantile(spearman, probs = 0.025),
spearman_97.5 = quantile(spearman, probs = 0.975),
num_vars = n()/1000) %>%
datatable() %>%
formatRound(columns=c('icc_median', 'icc_2.5', 'icc_97.5', 'spearman_median', 'spearman_2.5', 'spearman_97.5'), digits=3)
tmp_save = tmp %>%
drop_na() %>% #try removing this in final release
group_by(overall_difference, raw_fit, rt_acc) %>%
summarise(icc_median = quantile(icc, probs = 0.5),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
num_vars = n()/1000) %>%
ungroup() %>%
mutate(overall_difference = as.character(overall_difference),
raw_fit = as.character(raw_fit),
rt_acc = as.character(rt_acc)) %>%
arrange(-icc_median)
sjt.df(tmp_save %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/ddm_rel_table.doc")
| overall_difference | raw_fit | rt_acc | icc_median | icc_2.5 | icc_97.5 | num_vars |
|---|---|---|---|---|---|---|
| overall | raw | rt | 0.736 | 0.535 | 0.854 | 16 |
| overall | EZ | drift rate | 0.724 | 0.528 | 0.826 | 14 |
| overall | raw | accuracy | 0.684 | -0.375 | 0.832 | 14 |
| overall | hddm | drift rate | 0.684 | 0.387 | 0.816 | 16 |
| overall | EZ | non-decision | 0.647 | 0.197 | 0.794 | 14 |
| overall | EZ | threshold | 0.624 | 0.384 | 0.823 | 14 |
| overall | hddm | threshold | 0.544 | 0.316 | 0.668 | 14 |
| overall | hddm | non-decision | 0.449 | -0.032 | 0.671 | 14 |
| difference | hddm | drift rate | 0.404 | -0.242 | 0.651 | 18 |
| difference | raw | rt | 0.299 | -0.193 | 0.63 | 25 |
| difference | raw | accuracy | 0.27 | -0.245 | 0.698 | 13 |
| difference | EZ | drift rate | 0.248 | -0.196 | 0.579 | 17 |
| difference | EZ | non-decision | 0.236 | -0.204 | 0.652 | 17 |
| difference | EZ | threshold | 0.097 | -0.449 | 0.537 | 17 |
| difference | hddm | threshold | 0.09 | -0.201 | 0.345 | 1 |
Comparing overall vs difference: overall has higher reliability than difference.
summary(lmerTest::lmer(icc ~ overall_difference + (1|dv) ,tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ overall_difference + (1 | dv)
Data: tmp
REML criterion at convergence: -401676
Scaled residuals:
Min 1Q Median 3Q Max
-11.245 -0.467 0.041 0.521 6.005
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.03486 0.1867
Residual 0.00966 0.0983
Number of obs: 224000, groups: dv, 224
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.245 0.018 221.400 13.6 <2e-16 ***
overall_differenceoverall 0.368 0.025 221.400 14.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
ovrll_dffrn -0.720
“consistent with the summing of variance in the difference score”
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(ddm_task == 1,
overall_difference != "condition",
rt_acc != 'other') %>%
drop_na() %>%
left_join(boot_df[,c("dv", "icc", "var_subs","var_ind", "var_resid")], by = 'dv') %>%
mutate(var_subs_pct = var_subs/(var_subs+var_ind+var_resid),
var_ind_pct = var_ind/(var_subs+var_ind+var_resid),
var_resid_pct = var_resid/(var_subs+var_ind+var_resid)) %>%
select(-task, -ddm_task, -num_all_trials, -var_resid, -var_subs, -var_ind) %>%
gather(key, value, -dv, -overall_difference, -raw_fit, -rt_acc, -icc)
tmp %>%
filter(key == "var_resid_pct") %>%
ggplot(aes(raw_fit, value, fill=overall_difference))+
geom_boxplot()
summary(lmer(value ~ overall_difference*raw_fit + (1|dv), data=tmp %>% filter(key=="var_resid_pct")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference * raw_fit + (1 | dv)
Data: tmp %>% filter(key == "var_resid_pct")
REML criterion at convergence: -608085
Scaled residuals:
Min 1Q Median 3Q Max
-5.529 -0.543 0.020 0.601 5.547
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.00539 0.0734
Residual 0.00385 0.0620
Number of obs: 224000, groups: dv, 224
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3437 0.0103 33.4
overall_differenceoverall -0.1594 0.0153 -10.4
raw_fithddm -0.1070 0.0197 -5.4
raw_fitraw -0.0320 0.0157 -2.0
overall_differenceoverall:raw_fithddm 0.1270 0.0253 5.0
overall_differenceoverall:raw_fitraw 0.0240 0.0236 1.0
Correlation of Fixed Effects:
(Intr) ovrll_ rw_fth rw_ftr
ovrll_dffrn -0.672
raw_fithddm -0.521 0.350
raw_fitraw -0.653 0.439 0.340
ovrll_dffrncvrll:rw_fth 0.406 -0.605 -0.780 -0.265
ovrll_dffrncvrll:rw_ftr 0.436 -0.649 -0.227 -0.668
ovrll_dffrncvrll:rw_fth
ovrll_dffrn
raw_fithddm
raw_fitraw
ovrll_dffrncvrll:rw_fth
ovrll_dffrncvrll:rw_ftr 0.392
#Simple effects
summary(lmer(value ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "EZ")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference + (1 | dv)
Data: tmp %>% filter(key == "var_resid_pct", raw_fit == "EZ")
REML criterion at convergence: -241956
Scaled residuals:
Min 1Q Median 3Q Max
-5.224 -0.536 0.041 0.607 4.544
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.00448 0.0669
Residual 0.00431 0.0657
Number of obs: 93000, groups: dv, 93
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.34371 0.00938 36.7
overall_differenceoverall -0.15937 0.01395 -11.4
Correlation of Fixed Effects:
(Intr)
ovrll_dffrn -0.672
summary(lmer(value ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "raw")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference + (1 | dv)
Data: tmp %>% filter(key == "var_resid_pct", raw_fit == "raw")
REML criterion at convergence: -184635
Scaled residuals:
Min 1Q Median 3Q Max
-4.952 -0.522 0.021 0.590 4.769
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.00525 0.0724
Residual 0.00385 0.0620
Number of obs: 68000, groups: dv, 68
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3117 0.0118 26.51
overall_differenceoverall -0.1354 0.0177 -7.65
Correlation of Fixed Effects:
(Intr)
ovrll_dffrn -0.664
summary(lmer(value ~ overall_difference + (1|dv), data=tmp %>% filter(key=="var_resid_pct", raw_fit== "hddm")))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ overall_difference + (1 | dv)
Data: tmp %>% filter(key == "var_resid_pct", raw_fit == "hddm")
REML criterion at convergence: -183232
Scaled residuals:
Min 1Q Median 3Q Max
-5.198 -0.592 -0.008 0.605 6.113
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.00689 0.0830
Residual 0.00317 0.0563
Number of obs: 63000, groups: dv, 63
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.2368 0.0190 12.43
overall_differenceoverall -0.0324 0.0228 -1.42
Correlation of Fixed Effects:
(Intr)
ovrll_dffrn -0.836
Checking Rogosa’s claim that ‘Difference scores are reliable when individual differences in true change exist.’ How do we measure ‘individual difference in true change’ 1.between subject variability in difference score distributions 2.within subject variance in icc Using 2 (because comparable between subject variance across different measures would be hard) Expecting to see: positive correlation between mean icc and mean var_ind_pct Result: The correlation between mean icc and mean var ind pct is NOT significant BUT looking at the distribution of mean var ind pct there is not a lot of variability t
tmp = tmp %>%
spread(key, value) %>%
group_by(dv) %>%
summarise(mean_var_ind_pct = mean(var_ind_pct),
mean_var_subs_pct = mean(var_subs_pct),
mean_icc = mean(icc)) %>%
left_join(measure_labels, by='dv')
Warning: Column `dv` joining character vector and factor, coercing into
character vector
tmp %>%
filter(overall_difference == 'difference') %>%
ggplot(aes(mean_var_ind_pct, mean_icc))+
geom_point()+
geom_abline(slope=1, intercept=0)+
geom_smooth(method = "lm")
summary(lm(mean_icc ~ mean_var_ind_pct,data=tmp %>%
filter(overall_difference == 'difference')))
Call:
lm(formula = mean_icc ~ mean_var_ind_pct, data = tmp %>% filter(overall_difference ==
"difference"))
Residuals:
Min 1Q Median 3Q Max
-0.6262 -0.1294 -0.0076 0.1613 0.3764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2302 0.0366 6.29 7.1e-09 ***
mean_var_ind_pct 0.0587 0.1197 0.49 0.62
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.212 on 106 degrees of freedom
Multiple R-squared: 0.00227, Adjusted R-squared: -0.00715
F-statistic: 0.241 on 1 and 106 DF, p-value: 0.625
tmp %>%
filter(overall_difference == 'difference') %>%
ggplot(aes(mean_var_ind_pct))+
geom_density()
summary(tmp$mean_var_ind_pct[tmp$overall_difference == "difference"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0917 0.1350 0.1820 0.2540 0.3350 0.9560
summary(tmp$mean_var_subs_pct[tmp$overall_difference == "difference"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0255 0.3690 0.4440 0.4330 0.5170 0.6300
tmp = tmp %>% filter(overall_difference == 'difference')
t.test(tmp$mean_var_ind_pct, tmp$mean_var_subs_pct, paired=T)
Paired t-test
data: tmp$mean_var_ind_pct and tmp$mean_var_subs_pct
t = -6.7, df = 110, p-value = 8e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2318 -0.1265
sample estimates:
mean of the differences
-0.1791
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(ddm_task == 1,
overall_difference != "condition",
rt_acc != 'other') %>%
drop_na() %>%
left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv')
Comparing raw vs ddm in overall estimates: EZ is significantly better than HDDM and comparable to raw estimates.
summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(overall_difference == "overall")))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ raw_fit + (1 | dv)
Data: tmp %>% filter(overall_difference == "overall")
REML criterion at convergence: -278565
Scaled residuals:
Min 1Q Median 3Q Max
-11.366 -0.485 0.031 0.510 8.141
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.02299 0.1516
Residual 0.00526 0.0725
Number of obs: 116000, groups: dv, 116
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.6493 0.0234 113.4000 27.75 <2e-16 ***
raw_fithddm -0.1068 0.0327 113.4000 -3.27 0.0014 **
raw_fitraw 0.0186 0.0362 113.4000 0.51 0.6080
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) rw_fth
raw_fithddm -0.715
raw_fitraw -0.645 0.462
Comparing raw vs ddm in difference scores: EZ is significantly worse than HDDM and comparable to raw estimates.
summary(lmerTest::lmer(icc ~ raw_fit + (1|dv) ,tmp %>% filter(overall_difference == "difference")))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ raw_fit + (1 | dv)
Data: tmp %>% filter(overall_difference == "difference")
REML criterion at convergence: -150642
Scaled residuals:
Min 1Q Median 3Q Max
-9.214 -0.535 0.065 0.615 4.172
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.0423 0.206
Residual 0.0144 0.120
Number of obs: 108000, groups: dv, 108
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.1900 0.0288 105.1000 6.59 1.8e-09 ***
raw_fithddm 0.1425 0.0553 105.1000 2.58 0.011 *
raw_fitraw 0.0855 0.0441 105.1000 1.94 0.055 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) rw_fth
raw_fithddm -0.521
raw_fitraw -0.653 0.340
tmp %>%
ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
geom_boxplot()+
facet_wrap(~factor(overall_difference, levels=c("overall", "difference"), labels=c("Overall", "Difference")))+
theme_bw()+
ylab("ICC")+
xlab("")+
theme(legend.title = element_blank(),
legend.position = 'bottom',
legend.text = element_text(size=16),
strip.text = element_text(size=16),
axis.text = element_text(size = 16),
text = element_text(size=16))+
guides(fill = guide_legend(ncol = 2))+
scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))
ggsave('Bootstrap_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)
tmp2 = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(ddm_task == 1,
overall_difference != "condition",
rt_acc != "other") %>%
drop_na() %>%
left_join(rel_df[,c("dv", "icc", "spearman")], by = 'dv')
tmp2 %>%
ggplot(aes(factor(raw_fit, levels = c("raw", "EZ", "hddm"), labels=c("Raw", "EZ-diffusion", "Hierarchical diffusion")), icc, fill=factor(rt_acc, levels = c("rt","accuracy", "drift rate", "threshold", "non-decision"), labels=c("Response Time", "Accuracy","Drift Rate", "Threshold", "Non-decision"))))+
geom_boxplot()+
facet_wrap(~factor(overall_difference, levels=c("overall", "difference"), labels=c("Overall", "Difference")))+
theme_bw()+
ylab("ICC")+
xlab("")+
theme(legend.title = element_blank(),
legend.position = 'bottom',
legend.text = element_text(size=16),
strip.text = element_text(size=16),
axis.text = element_text(size = 16),
text = element_text(size=16))+
guides(fill = guide_legend(ncol = 2))+
scale_fill_discrete(breaks=c("Drift Rate", "Threshold", "Non-decision", "Response Time", "Accuracy"))
ggsave('PointEst_DDM_Comp.jpg', device = "jpeg", path = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/figures/", width = 14, height = 10, units = "in", limitsize = FALSE)
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'survey') %>%
left_join(boot_df[,c("dv", "icc", "spearman")], by = 'dv') %>%
filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2) %>%
group_by(task_name) %>%
summarise(median_icc = median(icc),
median_spearman = median(spearman),
icc_2.5 = quantile(icc, probs = 0.025),
icc_97.5 = quantile(icc, probs = 0.975),
min_spearman = min(spearman),
max_spearman = max(spearman),
num_measures = n()/1000,
mean_num_trials = round(mean(num_all_trials)))%>%
arrange(-median_icc, -median_spearman)
tmp %>%
datatable() %>%
formatRound(columns=c('median_spearman', 'median_icc',
'min_spearman', 'icc_2.5',
'max_spearman', 'icc_97.5'), digits=3)
tmp = tmp%>%
select(-min_spearman, -max_spearman,-median_spearman) %>%
mutate(task_name = gsub("_", " ", task_name),
task_name = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", task_name, perl=TRUE),
task_name = gsub("Survey", "", task_name))
names(tmp) = gsub("_", " ", names(tmp))
names(tmp) = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", names(tmp), perl=TRUE)
sjt.df(tmp %>% mutate_if(is.numeric, funs(round(., 3))), describe=F, hide.progress = TRUE, show.rownames = FALSE, file = "/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_Retest_Analyses/output/tables/survey_rel_table.doc")
| Task Name | Median Icc | Icc 2.5 | Icc 97.5 | Num Measures | Mean Num Trials |
|---|---|---|---|---|---|
| Self Regulation | 0.95 | 0.936 | 0.961 | 1 | 31 |
| Grit Scale | 0.942 | 0.924 | 0.958 | 1 | 8 |
| Brief Self Control | 0.941 | 0.92 | 0.966 | 1 | 13 |
| Ten Item Personality | 0.936 | 0.852 | 0.966 | 5 | 2 |
| Time Perspective | 0.92 | 0.865 | 0.954 | 5 | 11 |
| Five Facet Mindfulness | 0.909 | 0.828 | 0.965 | 6 | 13 |
| Bis Bas | 0.904 | 0.851 | 0.958 | 5 | 7 |
| Dospert Rt | 0.903 | 0.855 | 0.953 | 5 | 6 |
| Impulsive Venture | 0.9 | 0.815 | 0.95 | 2 | 17 |
| Sensation Seeking | 0.899 | 0.852 | 0.951 | 5 | 16 |
| Mindful Attention Awareness | 0.897 | 0.862 | 0.928 | 1 | 15 |
| Erq | 0.896 | 0.827 | 0.933 | 2 | 5 |
| Upps Impulsivity | 0.886 | 0.718 | 0.959 | 5 | 12 |
| Mpq Control | 0.879 | 0.796 | 0.95 | 1 | 24 |
| Eating | 0.872 | 0.813 | 0.914 | 4 | 9 |
| Bis11 | 0.864 | 0.63 | 0.939 | 4 | 15 |
| Dickman | 0.85 | 0.721 | 0.91 | 2 | 12 |
| Future Time Perspective | 0.844 | 0.792 | 0.903 | 1 | 10 |
| Dospert Rp | 0.842 | 0.761 | 0.901 | 5 | 6 |
| Dospert Eb | 0.8 | 0.711 | 0.868 | 5 | 6 |
| Selection Optimization Compensation | 0.799 | 0.598 | 0.898 | 4 | 12 |
| Leisure Time Activity | 0.798 | 0.731 | 0.857 | 1 | 1 |
| Theories Of Willpower | 0.794 | 0.725 | 0.86 | 1 | 12 |
Does number of items in a survey have a significant effect on the average ICC of survey measures? No.
tmp = measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'survey') %>%
# left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2)
# summary(lm(icc ~ num_all_trials, data = tmp))
summary(lmerTest::lmer(icc ~ num_all_trials + (1|dv), data = tmp))
Linear mixed model fit by REML t-tests use Satterthwaite approximations
to degrees of freedom [lmerMod]
Formula: icc ~ num_all_trials + (1 | dv)
Data: tmp
REML criterion at convergence: -320933
Scaled residuals:
Min 1Q Median 3Q Max
-10.265 -0.516 -0.004 0.540 6.540
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.003327 0.0577
Residual 0.000673 0.0259
Number of obs: 72000, groups: dv, 72
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.864206 0.011510 70.400000 75.08 <2e-16 ***
num_all_trials 0.001071 0.000915 70.400000 1.17 0.25
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
num_ll_trls -0.807
measure_labels %>%
mutate(dv = as.character(dv)) %>%
filter(task == 'task') %>%
# left_join(rel_df[,c("dv", "spearman","icc")], by='dv') %>%
left_join(boot_df[,c("dv", "spearman","icc")], by='dv') %>%
filter(overall_difference != 'difference' & raw_fit %in% c('EZ', 'hddm') == FALSE)%>%
separate(dv, c('task_name', 'extra_1', 'extra_2'), sep = '\\.',remove=FALSE) %>%
select(-extra_1, -extra_2) %>%
ggplot(aes(num_all_trials, icc))+
geom_point()+
geom_smooth(method="lm")+
theme_bw()+
xlab("Number of trials")+
ylab("ICC")